September 18, 2016

Wrangling Data

Agenda

  • Data Import

  • Tidying Data

  • Manipulating Data

Data Import

The tidyverse includes a number of easy-to-use packages for importing data:

  • readr for text files (including .csv and .tsv files)
  • readxl for Excel files
  • haven for Stata, SPSS, and SAS files
  • and of course base-R load() loads .RData/.rda files

Data Import

The tidyverse includes a number of easy-to-use packages for importing data:

  • readr for text files (including .csv and .tsv files)

  • readxl for Excel files

  • haven for Stata, SPSS, and SAS files

  • and of course base-R load() loads .RData/.rda files

The rio package provides a one-size-fits-all shortcut: import()

Data Access

Getting data into R requires us to have data.

  • For our own survey or experiment, we can just publish the data

  • For others’ data, though, we should provide replicable access

Data Access with a Static Direct Link

Data Access with a Static Direct Link

Data Access with a Static Direct Link

Data Access with Login

  • Many data archives request user information, either to restrict access or simply to generate data for justify their funding
  • Even if links are static, they are no longer direct: passing a link to download.file() won’t work
  • For many frequently accessed sites, there are packages that provide a solution

Data Access with Login: gesis

Data Access with Login: gesis

Data Access with Login: gesis

Data Access with Login: gesis

Data Access with Login: gesis

Data Access with Login: gesis

library(gesis)
fs <- login(username = "frederick-solt@uiowa.edu", 
            password = "not_my_real_password!") 
download_dataset(s = fs, 
                 doi = "6643", 
                 path = "cmcr04_files",
                 purpose = 1) # "1. for scientific research"
## Downloading DOI: 6643
list.files(path = "cmcr04_files", pattern="ZA.*")
## [1] "ZA6643_v2-0-0.dta"

Data Access with Login: gesis

Note that you should actually save your username and password in your .Renviron as “GESIS_USER” and “GESIS_PASS” to keep your information private.

Data Access with Login: gesis

Note that you should actually save your username and password in your .Rprofile as “gesis_user” and “gesis_pass” to keep your information private.

library(gesis)
fs <- login(getOption("gesis_user"), getOption("gesis_pass"))
download_dataset(s = fs, 
                 doi = "6643", 
                 path = "cmcr04_files",
                 purpose = 1) # "1. for scientific research"
## Downloading DOI: 6643

Data Access: A Few More Packages

Transforming Data

  • filter()
  • arrange()
  • select()
  • mutate() / transmute()
  • group_by() + summarize()

Transforming Data

Tidying Data

Remember that Australian Excel file we grabbed with the static link?

It’s got a problem:

Tidying Data

It isn’t tidy.

This means that it won’t play well with others.

Tidying Data

Remember that Australian Excel file we grabbed with the static link?

It won’t play well with others, so we’ll have to tidy it before use.

Tidying Data

library(rio)
ineq <- import("cmcr04_files/abs.xls", sheet = "Table 1.1", skip = 4) 

Tidying Data

library(rio)
ineq <- import("cmcr04_files/abs.xls", sheet = "Table 1.1", skip = 4)
View(ineq)

Tidying Data

First step: give names to those first two columns!

names(ineq)
##  [1] "..1"        "..2"        "1994–95"    "1995–96"   
##  [5] "1996–97"    "1997–98"    "1999–2000"  "2000–01"   
##  [9] "2002–03"    "2003–04(a)" "2005–06(a)" "2007–08(a)"
## [13] "2009–10(a)" "2011–12(a)" "2013–14(a)"
names(ineq)[1:2] <- c("var", "unit")
names(ineq)
##  [1] "var"        "unit"       "1994–95"    "1995–96"   
##  [5] "1996–97"    "1997–98"    "1999–2000"  "2000–01"   
##  [9] "2002–03"    "2003–04(a)" "2005–06(a)" "2007–08(a)"
## [13] "2009–10(a)" "2011–12(a)" "2013–14(a)"

Tidying Data

Second step: gather the (Gini) data

library(tidyr)
gathered <- ineq %>% 
    filter(var=="Gini coefficient") %>% 
    mutate(var = ifelse(unit=="RSE(%)", "gini_rse", "gini")) %>% 
    select(-unit) %>% 
    gather(key = year, value = val, -var)

Tidying Data

Second step: gather the (Gini) data

Tidying Data

Third step: spread the data again

spreaded <- gathered %>% 
    spread(key = var, value = val)

Tidying Data

Third step: spread the data again

It’s tidy now!

Tidying Data

It’s tidy, but its still not clean. Fourth step: clean up

library(stringr) # more on stringr next week!
ineq_tidy <- spreaded %>% 
    transmute(country = "Australia",
              year = str_replace(year, "\\(a\\)", ""),
              year = ifelse(str_extract(year, "\\d{2}$") %>%
                                as.numeric() > 50,
                            str_extract(year, "\\d{2}$") %>%
                                as.numeric() + 1900,
                            str_extract(year, "\\d{2}$") %>% 
                                as.numeric() + 2000),
              gini = as.numeric(gini),
              gini_se = as.numeric(gini_rse)/100 * gini)

Tidying Data

It’s tidy, but its still not clean. Fourth step: clean up